[MRG] ENH: K-Means SMOTE implementation #435

StephanHeijl · 2018-06-22T08:55:08Z

What does this implement/fix? Explain your changes.

This pull request implements K-Means SMOTE, as described in Oversampling for Imbalanced Learning Based on K-Means and SMOTE by Last et al.

Any other comments?

The density estimation function has been changed slightly from the reference paper, as the power term yielded very large numbers. This caused the weighting to favour a single cluster.

…amples.

codecov · 2018-06-22T09:25:48Z

Codecov Report

Merging #435 into master will decrease coverage by 2.43%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #435      +/-   ##
==========================================
- Coverage    98.9%   96.47%   -2.44%     
==========================================
  Files          84       85       +1     
  Lines        5202     5329     +127     
==========================================
- Hits         5145     5141       -4     
- Misses         57      188     +131

Impacted Files	Coverage Δ
imblearn/over_sampling/tests/test_kmeans_smote.py	`100% <100%> (ø)`
imblearn/over_sampling/_smote.py	`97.82% <100%> (+0.55%)`	⬆️
imblearn/over_sampling/__init__.py	`100% <100%> (ø)`	⬆️
imblearn/utils/estimator_checks.py	`90.87% <100%> (-6.22%)`	⬇️
imblearn/keras/tests/test_generator.py	`8.62% <0%> (-91.38%)`	⬇️
imblearn/tensorflow/_generator.py	`34.28% <0%> (-65.72%)`	⬇️
imblearn/keras/_generator.py	`40.35% <0%> (-57.9%)`	⬇️
imblearn/over_sampling/tests/test_smote_nc.py	`94.3% <0%> (-5.7%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c630df3...c3a1502. Read the comment docs.

chkoar · 2018-06-22T13:51:38Z

@felix-last as the author and creator of the algorithm are you willing to review this PR?

StephanHeijl · 2018-06-22T14:16:07Z

@chkoar @felix-last Thanks in advance for taking a look at this. It must be said that my initial research was somewhat poor, which caused me to neglect the existing implementation by the original author. As such, this is was programmed without actually looking at that code. I nevertheless thought it valuable to contribute this back to the project.

felix-last · 2018-06-26T13:10:10Z

Unfortunately I don't have time for a detailed review. Just a few comments after glancing over:

The exponent used for the density computation is actually a hyperparameter of the algorithm described in the paper. In our implementation, it defaults to the number of features, but can be set arbitrarily by the user.
In the original implementation, k_neighbors is decreased when it is larger than the number of samples in a given cluster. This enables the user to specify, for example, k_neighbors=float('inf') to apply SMOTE to interpolate among all samples of a cluster.
This implementation does not encompass limit cases of SMOTE and random oversampling described in section 3.2 of the paper.

I'd like to point out that our implementation also complies with the imbalanced-learn API.

StephanHeijl · 2018-06-26T14:59:42Z

@felix-last Sincere thanks for the lookover. I'll be happy to make these changes and push them to the project, provided @chkoar would still like it included.

glemaitre · 2018-06-27T11:52:19Z

@StephanHeijl we are fine including new methods (I would be frilled to get something working on categorical data :)) until that we provide references.
ping me when you made the last changes.

…elix Last.

pep8speaks · 2018-07-01T13:20:13Z

Hello @StephanHeijl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file imblearn/over_sampling/_smote.py:

Line 1222:5: E303 too many blank lines (2)

In the file imblearn/over_sampling/tests/test_kmeans_smote.py:

Line 60:25: E128 continuation line under-indented for visual indent
Line 61:25: E128 continuation line under-indented for visual indent
Line 69:1: E302 expected 2 blank lines, found 1

Comment last updated at 2019-06-12 13:20:16 UTC

StephanHeijl · 2018-07-12T14:11:24Z

@glemaitre I took some time to work out the last kinks and implement the edge cases from the paper, but I believe this can be subjected to a thorough review.

glemaitre · 2018-07-12T15:39:10Z

Ok I will try to have a look

glemaitre · 2018-07-25T10:11:28Z

The PR looks in general good. @chkoar Before to carry on a full review I have an API question.

I am split between:

(1) adding a new kind of SMOTE and adding 3 new parameters (as in this PR);
(2) deprecate kind for SMOTE-SVM and create 2 new classes SMOTESVM and SMOTEKMeans.

In (1), I am scared that we will have to much parameters for the common case which is painful for the user. In (2), we will have 3 classes instead of 1.

What are you thoughts @chkoar

chkoar · 2018-07-25T18:03:29Z

It seems that SMOTE class gets much responsibility. Also, according to this article there are almost 100 hundred published variations of SMOTE. So, I believe that option two is the way to go. We have to take care not to duplicate code between classes.

glemaitre · 2018-07-25T21:28:07Z

OK so let's go for a base class BaseSMOTE to factorize code as much as possible.

@StephanHeijl I will make a PR and then you will need to move the code within a new class which will inherit from the base class. It should be a minimal change but quite beneficial.

StephanHeijl · 2018-07-26T11:39:59Z

@glemaitre Will do, once the PR drops I'll integrate it and move the code. I agree that this should make SMOTE implementation a lot more accessible.

glemaitre · 2018-07-26T21:59:45Z

@StephanHeijl I think that you can start to have a look to #440.
I still have couple of thing to update but the basically, you just have to inherit from BaseSMOTE and move the implementation in _sample.

glemaitre · 2018-07-26T22:03:56Z

imblearn/over_sampling/smote.py


    svm_estimator : object, optional (default=SVC())
        If ``kind='svm'``, a parametrized :class:`sklearn.svm.SVC`
        classifier can be passed.

+    n_kmeans_clusters: int, optional (default=10)


I think that we could make this a KMeans estimator instead.
I am thinking that it could be interesting to plug a MiniBatchKMeans for instance which could be quicker to converge and memory efficient. This could be the default. However, it let people to be able to pass a KMeans classifier if interested.

Completely agree, MiniBatchKMeans is now the default and a test has been included to ensure that using a "normal" kmeans instance will also work.

glemaitre · 2018-07-27T20:19:18Z

I merged #440, you should be able to create a new class.
You can make the quick change such that I can do a full review.

chkoar · 2019-03-07T14:45:19Z

@StephanHeijl thanks. @felix-last @glemaitre do you have time for a review?

…rn into kmeans-smote

felix-last

As a sanity check, I tried creating some of the toy dataset plots from our paper using this implementation.

For some examples this implementation yields unwanted results. Of course, we couldn't expect results to be exactly the same, but it could also be a hint that something is wrong (or that I'm overlooking something or it's just an unlucky random seed). I tried translating our implementation's default parameters (which I used for the toy plots in the paper) to the corresponding parameters of this implementation.

Comparison our implementation (left) vs. this implementation (right)

After a quick glance, the code looks good to me.

Feel free to play around with the plotting script. Prerequisites are to install requirements.txt, install this PR's imblearn and create a config.yml with a results_dir and dataset_dir. Download the toy datasets into the datasets dir. For comparison, here's all the toy plots obtained from our implementation: kmeans-smote-figures.zip.

StephanHeijl · 2019-04-01T12:34:58Z

@felix-last Thank you very much for taking the time to look into this! It appears that I entered a wrong parameter for the _make_samples function. I have resolved this and the results that I gathered from your (very helpful) plotting script appear to be far more in line with the expected results. I have uploaded the results here: https://github.com/StephanHeijl/imblearn-kmeans-toydatasets . Sample:

I have updated the tests and the script in accordance with this improvement.

felix-last

That looks more sane! Looks good to me now (but I didn't do a thorough code review).

chkoar

Apart from my latest comments the only thing that is missing is some user guide.

imblearn/over_sampling/_smote.py

chkoar · 2019-05-05T19:00:17Z

imblearn/over_sampling/_smote.py

+           Use the parameter ``sampling_strategy`` instead. It will be removed
+           in 0.6.
+
+    """


At the end of the docstring we could add the reference of the paper and after that an interactive example. Check here for instance.

I added an interactive example that specifically demonstrates the utility of the KMeansSmote class, 3 blobs, with the positive class in the middle and the negative classes on the outside and a single negative sample in the middle blob. The example shows that after resampling no new samples are added in the middle blob. Inspired by the following toy problem:

StephanHeijl · 2019-05-06T14:02:40Z

@chkoar I have attended to your proposed changes, I hope this covers enough of an interactive example.

chkoar · 2019-06-12T00:08:10Z

@StephanHeijl could you please rebase your branch so we can finally merge this pr?

glemaitre · 2019-06-12T10:13:34Z

I made the push to solve the conflict. I will review and make couple of changes regarding the style if necessary and merge it.

StephanHeijl added 4 commits June 21, 2018 16:27

Initial K-Means SMOTE commit.

dba11ac

PEP8, PyFlakes fixes, corrected paper reference.

d54ffc2

Added examples.

4a9b990

Added error when clustering fails to find a cluster with sufficient s…

5dd0526

…amples.

StephanHeijl added 4 commits June 22, 2018 11:41

Added test for wrong hyperparameters

642e62e

Save an indexing operation if cluster_class_mean is insufficient.

fd663f1

Simplified vstack function call.

0ef982b

Resolved stacking error

4c37593

Added extra arguments for kmeans sampling, addressed suggestions by F…

efb6a75

…elix Last.

StephanHeijl and others added 3 commits July 1, 2018 15:33

Resolved errors and warnings

131e3b3

Resolve PEP8 style issues

7de5951

Added special k-means cases and tests.

7029266

StephanHeijl changed the title ~~[WIP] K-Means SMOTE implementation~~ [MRG] K-Means SMOTE implementation Jul 12, 2018

glemaitre mentioned this pull request Jul 26, 2018

[MRG] EHN: split and factorize SMOTE classes #440

Merged

4 tasks

glemaitre reviewed Jul 26, 2018

View reviewed changes

chkoar requested a review from glemaitre March 7, 2019 15:30

StephanHeijl added 3 commits March 10, 2019 13:45

Resolved PEP8 issues

1f34912

Merge branch 'kmeans-smote' of github.com:StephanHeijl/imbalanced-lea…

05d3f40

…rn into kmeans-smote

Fixed using the wrong variable name

6129fbf

felix-last suggested changes Mar 30, 2019

View reviewed changes

StephanHeijl added 2 commits April 1, 2019 14:18

Fixed error in _make_samples call, resolves mediocre sample selection.

9537ec9

Updated KMeansSMOTE tests

b6fbca4

Clarified RuntimeError with solution to problem

ca9b541

felix-last approved these changes Apr 2, 2019

View reviewed changes

chkoar reviewed May 5, 2019

View reviewed changes

StephanHeijl added 6 commits May 6, 2019 10:21

Adjusted documentation according to @chkoar's review.

1b4dfd2

Slightly adjusted test to 'fail' for regular SMOTE.

367f3a0

Merge branch 'master' into kmeans-smote

7d79475

Fix expected print output

4a414c3

Added ratio back to pass check_samplers_ratio_fit_resample test

d9fa137

Added KMeansSMOTE to DONT_SUPPORT_RATIO and removed space from print

cf1b1fe

glemaitre added this to the 0.5 milestone Jun 11, 2019

chkoar changed the title ~~[MRG] K-Means SMOTE implementation~~ [MRG] ENH: K-Means SMOTE implementation Jun 12, 2019

Merge remote-tracking branch 'origin/master' into kmeans-smote

9842573

glemaitre added 4 commits June 12, 2019 14:16

cleaning

0c4dd16

DOC: add an entry in documentation

f4ec980

DOC: add entry in API documentation

c3a1502

DOC: add whats new entry

032842e

glemaitre merged commit aff4125 into scikit-learn-contrib:master Jun 12, 2019

StephanHeijl deleted the kmeans-smote branch June 12, 2019 15:43

[MRG] ENH: K-Means SMOTE implementation #435

[MRG] ENH: K-Means SMOTE implementation #435

Uh oh!

Conversation

StephanHeijl commented Jun 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

codecov bot commented Jun 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chkoar commented Jun 22, 2018

Uh oh!

StephanHeijl commented Jun 22, 2018

Uh oh!

felix-last commented Jun 26, 2018

Uh oh!

StephanHeijl commented Jun 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Jun 27, 2018

Uh oh!

pep8speaks commented Jul 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-06-12 13:20:16 UTC

Uh oh!

StephanHeijl commented Jul 12, 2018

Uh oh!

glemaitre commented Jul 12, 2018

Uh oh!

glemaitre commented Jul 25, 2018

Uh oh!

chkoar commented Jul 25, 2018

Uh oh!

glemaitre commented Jul 25, 2018

Uh oh!

StephanHeijl commented Jul 26, 2018

Uh oh!

glemaitre commented Jul 26, 2018

Uh oh!

glemaitre Jul 26, 2018

Choose a reason for hiding this comment

Uh oh!

StephanHeijl Mar 7, 2019

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jul 27, 2018

Uh oh!

chkoar commented Mar 7, 2019

Uh oh!

felix-last left a comment

Choose a reason for hiding this comment

Comparison our implementation (left) vs. this implementation (right)

Uh oh!

StephanHeijl commented Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felix-last left a comment

Choose a reason for hiding this comment

Uh oh!

chkoar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chkoar May 5, 2019

Choose a reason for hiding this comment

Uh oh!

StephanHeijl May 6, 2019

Choose a reason for hiding this comment

Uh oh!

StephanHeijl commented May 6, 2019

Uh oh!

chkoar commented Jun 12, 2019

Uh oh!

glemaitre commented Jun 12, 2019

Uh oh!

Uh oh!

StephanHeijl commented Jun 22, 2018 •

edited

Loading

codecov bot commented Jun 22, 2018 •

edited

Loading

StephanHeijl commented Jun 26, 2018 •

edited

Loading

pep8speaks commented Jul 1, 2018 •

edited

Loading

StephanHeijl commented Apr 1, 2019 •

edited

Loading